Reduce GOP decoder stream synchronizations#33
Conversation
|
/build |
There was a problem hiding this comment.
This parameter enables NvDecoder async allocation/event mode. Without it, NvDecoder::HandlePictureDisplay synchronizes the decoder stream for each displayed frame; enabling it lets this path queue the displayed-frame copies and synchronize the decoder stream after DecProc has queued the output copies. I added a code comment at the constructor site to make that intent explicit.
| // Copy the decode frames from device | ||
| CUDA_DRVAPI_CALL(cuMemcpyDtoD((CUdeviceptr)pFrame_buffer, (CUdeviceptr)pFrame, decoder->GetFrameSize())); | ||
| // Keep the decode-buffer copy on the decoder stream. DecProc performs one | ||
| // terminal stream sync before returning the Python-visible frame. |
There was a problem hiding this comment.
"one terminal stream sync before returning the Python-visible frame" may be a misleading. There are some other sync when decoder cuda api call.
There was a problem hiding this comment.
Agreed. I changed the comment to avoid implying this is the only CUDA sync in the decoder path. It now only states that the raw output copy is queued on the decoder stream and that DecProc synchronizes that stream before returning the Python-visible frame.
0acdde1 to
431e590
Compare
|
/build |
1 similar comment
|
/build |
PR #33 —
|
| Build | min (ms) | median (ms) |
|---|---|---|
async=false (per-frame sync) |
40.8 | 43.5 |
async=true (PR) |
45.4 | 50.7 |
Distributions don't overlap (true min > false median).
Why (nsys, CUDA API call counts)
| API | bEnableAsyncAllocations =true |
bEnableAsyncAllocations =false |
|---|---|---|
cuStreamSynchronize |
1.4k | 64k |
cuEventRecord |
126k | 63k |
cuMemcpy2DAsync / cuLaunchKernel |
identical | identical |
cuMemAllocAsync |
warm-up only | — |
The flag just swaps ~63k per-frame cuStreamSynchronize for ~63k per-frame cuEventRecord — no
GPU work removed. The allocation change (its namesake) is irrelevant: buffers are pooled, so zero
allocs in steady state. Dropping the sync lets the host outrun a 2-surface pool, and the cost
reappears as blocking inside the enqueue calls.
431e590 to
91d4ed2
Compare
|
/build |
|
Updated in |
Signed-off-by: hongyizhang <805701948@qq.com>
91d4ed2 to
96dfdcc
Compare
Performance Analysis:
|
| Concurrent load | Old (cuMemcpyDtoD) |
New (cuMemcpyDtoDAsync) |
|---|---|---|
| None | 5.3 ms | 5.5 ms |
Non-blocking stream (torch.cuda.Stream()) |
5.4 ms | 5.5 ms |
Blocking stream (cudaStreamCreate) |
62.6 ms (+1083%) | 5.4 ms (+0%) |
Conclusion
- Typical PyTorch training (streams are
cudaStreamNonBlockingby
default): no performance difference — the null stream global barrier does
not apply to non-blocking streams. - Blocking stream workloads (legacy NCCL, cuBLAS global stream, custom
CUDA extensions): old code inflates decoder latency ~10×; new code is
unaffected. - Correctness: the new
cuStreamSynchronizeeliminates a race condition
where the last frame's buffer could be read before its copy completed.
Summary
Verification
git diff --checkDiscussion
This draft mirrors the internal MR diff for discussion. One point to review is that enabling
bEnableAsyncAllocationsactivates the existing async allocation path inNvDecoder.